Learning-based text classifiers using the Mahalanobis distance for correlated datasets

نویسندگان

  • Noopur Srivastava
  • Shrisha Rao
چکیده

We present a novel approach to text categorisation with the aid of the Mahalanobis distance measure for classification. For correlated datasets, classification using the Euclidean distance is not very accurate. The use of the Mahalanobis distance exploits the correlation in data for the purpose of classification. For achieving this on large datasets, an unsupervised dimensionality reduction technique, Principal Component Analysis (PCA) is used prior to classification using the k-nearest neighbours (kNN) classifier. As kNN does not work well for high-dimensional data, and moreover computing correlation for huge and sparse data is inefficient, we use PCA to obtain a reduced dataset for the training phase. Experimental results show improvement in classification accuracy and a significant reduction in error percentage by using the proposed algorithm on huge datasets, in comparison with classifiers using the Euclidean distance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Pre-Trained Ensemble Model for Breast Cancer Grade Detection Based on Small Datasets

Background and Purpose: Nowadays, breast cancer is reported as one of the most common cancers amongst women. Early detection of the cancer type is essential to aid in informing subsequent treatments. The newest proposed breast cancer detectors are based on deep learning. Most of these works focus on large-datasets and are not developed for small datasets. Although the large datasets might lead ...

متن کامل

Identifying Useful Variables for Vehicle Braking Using the Adjoint Matrix Approach to the Mahalanobis-Taguchi System

The Mahalanobis Taguchi System (MTS) is a diagnosis and forecasting method for multivariate data. Mahalanobis distance (MD) is a measure based on correlations between the variables and different patterns that can be identified and analyzed with respect to a base or reference group. MTS is of interest because of its reported accuracy in forecasting small, correlated data sets. This is the type o...

متن کامل

An investigation on scaling parameter and distance metrics in semi-supervised Fuzzy c-means

The scaling parameter α helps maintain a balance between supervised and unsupervised learning in semi-supervised Fuzzy c-Means (ssFCM). In this study, we investigated the effects of different α values, 0.1, 0.5, 1 and 10 in Pedrycz and Waletsky’s ssFCM with various amounts of labelled data, 10%, 20%, 30%, 40%, 50% and 60% and three distance metrics, Euclidean, Mahalanobis and kernel-based on th...

متن کامل

Applying the Mahalanobis-Taguchi System to Vehicle Ride

The Mahalanobis Taguchi System is a diagnosis and forecasting method for multivariate data. Mahalanobis distance is a measure based on correlations between the variables and different patterns that can be identified and analyzed with respect to a base or reference group. The Mahalanobis Taguchi System is of interest because of its reported accuracy in forecasting small, correlated data sets. Th...

متن کامل

Context Based Covariance for Supervised Learning

This paper proposes a new metric, called Context Based Covariance, to capture contextual information intrinsic to multivariate data. Based on this concept, a minimum distance classifier is designed, and its applicability to the domain of supervised machine learning is discussed. The performance of the proposed metric is compared with conventional minimum distance classifiers based on Mahalanobi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJBDI

دوره 3  شماره 

صفحات  -

تاریخ انتشار 2016